HEAD
Origin
The present dataframe has been created for marketing analysis purposes. It assembles various personal information about 2239 customers, such as their education level, income, age, marital status, number of children at home…
It also shows their consuming habits (amount spent on wine, on sweets…) and the number of purchases made on discounted products.
There is very few context concerning this dataframe, since the source is unknown. It is not clear when these informations were registered, but probably by 2014 since the date of customers’ enrollment within the company doesn’t go further than 2014.
Aim
To predict the customer’s behavior (Number of purchases made with a discount) depending on the most significant personal attributes
Attributes
ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise
MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
NumStorePurchases: Number of purchases made directly in stores
Origin The present dataframe has been created for marketing analysis purposes. It assembles various personal information about 2239 customers, such as their education level, income, age, marital status, number of children at home…
It also shows their consuming habits (amount spent on wine, on sweets…) and the number of purchases made on discounted products.
There is very few context concerning this dataframe, since the source is unknown. It is not clear when these informations were registered, but probably by 2014 since the date of customers’ enrollment within the company doesn’t go further than 2014.
Aims To predict the customer’s behavior (Number of purchases made with a discount) depending on the most significant personal attributes To categorize participants in a few typical profiles (probably with PCA)
Attributes
ID: Customer’s unique identifier Year_Birth: Customer’s birth year Education: Customer’s education level Marital_Status: Customer’s marital status Income: Customer’s yearly household income Kidhome: Number of children in customer’s household Teenhome: Number of teenagers in customer’s household Dt_Customer: Date of customer’s enrollment with the company Recency: Number of days since customer’s last purchase Complain: 1 if customer complained in the last 2 years, 0 otherwise
MntWines: Amount spent on wine in last 2 years MntFruits: Amount spent on fruits in last 2 years MntMeatProducts: Amount spent on meat in last 2 years MntFishProducts: Amount spent on fish in last 2 years MntSweetProducts: Amount spent on sweets in last 2 years MntGoldProds: Amount spent on gold in last 2 years
NumDealsPurchases: Number of purchases made with a discount AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise Response: 1 if customer accepted the offer in the last campaign, 0 otherwise NumStorePurchases: Number of purchases made directly in stores
Before loading the dataset, we want to make sure we have all the necessary packages installed and loaded, and that the code can be run by anybody.
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01if(!require(pacman)) {
install.packages("pacman")
library(pacman)
}
<<<<<<< HEAD
suppressPackageStartupMessages(pacman::p_load(tidyverse, gtsummary, ggpubr, moments, here, sjPlot, parameters, effectsize, pander, psych))
=======
suppressPackageStartupMessages(pacman::p_load(tidyverse, gtsummary, ggpubr, moments, here, sjPlot, parameters, effectsize, pander))
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
path = here("JULIETTE")
setwd(path)
data <- read.table("marketing_campaign.csv", header=T, sep="\t")
str(data)
## 'data.frame': 2240 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
pander(summary(data))
| ID | Year_Birth | Education | Marital_Status |
|---|---|---|---|
| Min. : 0 | Min. :1893 | Length:2240 | Length:2240 |
| 1st Qu.: 2828 | 1st Qu.:1959 | Class :character | Class :character |
| Median : 5458 | Median :1970 | Mode :character | Mode :character |
| Mean : 5592 | Mean :1969 | NA | NA |
| 3rd Qu.: 8428 | 3rd Qu.:1977 | NA | NA |
| Max. :11191 | Max. :1996 | NA | NA |
| NA | NA | NA | NA |
| Income | Kidhome | Teenhome | Dt_Customer |
|---|---|---|---|
| Min. : 1730 | Min. :0.0000 | Min. :0.0000 | Length:2240 |
| 1st Qu.: 35303 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | Class :character |
| Median : 51382 | Median :0.0000 | Median :0.0000 | Mode :character |
| Mean : 52247 | Mean :0.4442 | Mean :0.5062 | NA |
| 3rd Qu.: 68522 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | NA |
| Max. :666666 | Max. :2.0000 | Max. :2.0000 | NA |
| NA’s :24 | NA | NA | NA |
| Recency | MntWines | MntFruits | MntMeatProducts |
|---|---|---|---|
| Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 |
| 1st Qu.:24.00 | 1st Qu.: 23.75 | 1st Qu.: 1.0 | 1st Qu.: 16.0 |
| Median :49.00 | Median : 173.50 | Median : 8.0 | Median : 67.0 |
| Mean :49.11 | Mean : 303.94 | Mean : 26.3 | Mean : 166.9 |
| 3rd Qu.:74.00 | 3rd Qu.: 504.25 | 3rd Qu.: 33.0 | 3rd Qu.: 232.0 |
| Max. :99.00 | Max. :1493.00 | Max. :199.0 | Max. :1725.0 |
| NA | NA | NA | NA |
| MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases |
|---|---|---|---|
| Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 |
| 1st Qu.: 3.00 | 1st Qu.: 1.00 | 1st Qu.: 9.00 | 1st Qu.: 1.000 |
| Median : 12.00 | Median : 8.00 | Median : 24.00 | Median : 2.000 |
| Mean : 37.53 | Mean : 27.06 | Mean : 44.02 | Mean : 2.325 |
| 3rd Qu.: 50.00 | 3rd Qu.: 33.00 | 3rd Qu.: 56.00 | 3rd Qu.: 3.000 |
| Max. :259.00 | Max. :263.00 | Max. :362.00 | Max. :15.000 |
| NA | NA | NA | NA |
| NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth |
|---|---|---|---|
| Min. : 0.000 | Min. : 0.000 | Min. : 0.00 | Min. : 0.000 |
| 1st Qu.: 2.000 | 1st Qu.: 0.000 | 1st Qu.: 3.00 | 1st Qu.: 3.000 |
| Median : 4.000 | Median : 2.000 | Median : 5.00 | Median : 6.000 |
| Mean : 4.085 | Mean : 2.662 | Mean : 5.79 | Mean : 5.317 |
| 3rd Qu.: 6.000 | 3rd Qu.: 4.000 | 3rd Qu.: 8.00 | 3rd Qu.: 7.000 |
| Max. :27.000 | Max. :28.000 | Max. :13.00 | Max. :20.000 |
| NA | NA | NA | NA |
| AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 |
|---|---|---|---|
| Min. :0.00000 | Min. :0.00000 | Min. :0.00000 | Min. :0.00000 |
| 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.00000 |
| Median :0.00000 | Median :0.00000 | Median :0.00000 | Median :0.00000 |
| Mean :0.07277 | Mean :0.07455 | Mean :0.07277 | Mean :0.06429 |
| 3rd Qu.:0.00000 | 3rd Qu.:0.00000 | 3rd Qu.:0.00000 | 3rd Qu.:0.00000 |
| Max. :1.00000 | Max. :1.00000 | Max. :1.00000 | Max. :1.00000 |
| NA | NA | NA | NA |
| AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response |
|---|---|---|---|---|
| Min. :0.00000 | Min. :0.000000 | Min. :3 | Min. :11 | Min. :0.0000 |
| 1st Qu.:0.00000 | 1st Qu.:0.000000 | 1st Qu.:3 | 1st Qu.:11 | 1st Qu.:0.0000 |
| Median :0.00000 | Median :0.000000 | Median :3 | Median :11 | Median :0.0000 |
| Mean :0.01339 | Mean :0.009375 | Mean :3 | Mean :11 | Mean :0.1491 |
| 3rd Qu.:0.00000 | 3rd Qu.:0.000000 | 3rd Qu.:3 | 3rd Qu.:11 | 3rd Qu.:0.0000 |
| Max. :1.00000 | Max. :1.000000 | Max. :3 | Max. :11 | Max. :1.0000 |
| NA | NA | NA | NA | NA |
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free") +
geom_histogram()
<<<<<<< HEAD
We are only interested in the total number of promotions accepted by the customers, since we don’t have details about the nature of each promotion.
data$AcceptedCmpTotal <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5 + data$Response
The dataframe contains many variables, some are superfluous for our analysis (web visits and purchases, cumplains, catalog purchases, Z_Revenus and Z_CostContact which we don’t have information about)
data$Complain <- data$NumWebVisitsMonth <- data$NumWebPurchases <- data$NumCatalogPurchases <- data$Z_Revenue <- data$Z_CostContact <- data$AcceptedCmp1 <- data$AcceptedCmp2 <- data$AcceptedCmp3 <- data$AcceptedCmp4 <- data$AcceptedCmp5 <- data$Response <- NULL
We want to calculate the age of the customers. If we proceed with : “2021 - data$Year_Birth”, we would get their current age. It makes more sense to get their age at the moment the data was registered, so we proceed with 2014 minus dataYear_Birth, although here we are only assuming that it was indeed registered in 2014.
data$age <- 2014 - data$Year_Birth
plot(data$age)
<<<<<<< HEAD
We see 3 outliers who seems to be older than 110 years old. The corresponding birth years are 1893, 1900 and 1899. The first one could be corrected by 1993, the second one would be due to 2 typing errors which is improbable, and the third could be replaced by 1999 but it corresponds to someone who has a PhD education level, which is unlikely at age 15. Since the dataset is very big, we can choose to delete these lines.
# which(data$age>110)
data <- data[-c(193, 240, 340),]
Marital status can be simplified in only a few levels, and transformed into a factor. Since the “other” section represents less than 1% of the participants, it is not enough to model it as a factor.
We then transform some relevant variables into factors.
We see 3 outliers who seems to be older than 110 years old. The corresponding birth years are 1893, 1900 and 1899. The first one could be corrected by 1993, the second one would be due to 2 typing errors which is improbable, and the third could be replaced by 1999 but it corresponds to someone who has a PhD education level, which is unlikely at age 15. Since the dataset is very big, we can choose to delete these lines.
# which(data$age>110)
data <- data[-c(193, 240, 340),]
Marital status can be simplified in only a few levels, and transformed into a factor. Since the “other” section represents less than 1% of the participants, it is not enough to model it as a factor.
We then transform some relevant variables into factors.
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01data$Marital_Status <- factor(data$Marital_Status, labels = c("Other", "Single", "Single", "Married", "Single", "Together", "Single", "Other"))
data$Marital_Status[data$Marital_Status=="Other"] <- NA; data$Marital_Status = droplevels(data$Marital_Status)
data$ID <- factor(data$ID)
data$Education <- factor(data$Education)
data$Teenhome <- factor(data$Teenhome)
data$AcceptedCmpTotal <- factor(data$AcceptedCmpTotal)
data$Kidhome <- factor(data$Kidhome, labels = c("no", "yes", "yes"))
For Kidhome, we fused the answers “1” and “2” because there are only 2% of “2” which is not enough information to model it a one separate factor.
We now want to plot all the variables again, and check again whether anything is abnormal.
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free") +
geom_histogram()
<<<<<<< HEAD
Here we will be eyeballing the relationship between various variables, and determine those that seem the most relevant for further analysis.
data %>%
filter(!is.na(Marital_Status)) %>%
<<<<<<< HEAD
ggplot(aes(AcceptedCmpTotal, Income)) + geom_violin(trim=FALSE, fill='#A4A4A4', color ="darkred")
ggplot(data, aes(Education, MntWines))+ geom_boxplot(outlier.colour = "red") + geom_point(position = position_jitter(), color="coral4")
=======
ggplot(aes(MntWines, Marital_Status)) + geom_boxplot(na.rm = TRUE)
ggplot(data, aes(Kidhome, MntSweetProducts))+ geom_boxplot(outlier.colour = "red") + geom_point(position = position_jitter())
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
ggplot(data, aes(x=Income, y=NumStorePurchases)) +
geom_violin(trim=FALSE, fill='#A4A4A4', color="darkred")+
theme_minimal()
<<<<<<< HEAD
# geom_boxplot(width=0.05, outlier.colour = "red") + theme_minimal()
The number of store purchases seems to be the most useful variable to analyse since we might want to determine which profils buy the most in the store. In order to consider NumStorePurchases as a response variables for a linear model, we first have to check normality.
In this case, the data is intrisincally skewed because there are no negative values possible. Skewness is quite high (0.7) but remains tolerable regarding the nature of the data. It has to be kept in mind while performing the model assumption checks.
NumStorePurchases seems to be the most useful variable to analyse since marketing analysis might want to determine which profils buy the most in the store. In order to consider NumStorePurchases as a response variables for a linear model, we first have to check normality.
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01ggplot(data, aes(x = NumStorePurchases)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(adjust = 1) + labs(title = "Purchases",
<<<<<<< HEAD
caption = paste("skewness =", round(moments::skewness(data$NumStorePurchases, na.rm = TRUE),2)))
m1 <- lm(data=data, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)
Looking at the QQplot, we don’t see any specific patterns appearing, which means the residuals are distributed normally. Although we do observe smaller ends for the lowest theoritical quantiles.
For the Cook’s distance, we clearly observe an outlier (observation 2234), while the vast majority of the values are under 1.
For the comparison of residuals VS fitted values,
ggplot(data, aes(NumStorePurchases, Income)) + geom_point(color="blue", alpha=0.3, position = position_jitter())
Here we observe peculiar outliers for “Income”. One income is equal to 666’666, and for income values higher than 150’000, people did not respond. We choose to remove the responses of these customers because they probably did not write the true value of their income.
newdf = data %>%
filter(!ID %in% c(9432, 5555, 4619, 5336, 1501, 1503, 8475, 4931, 11181) )
m1 <- lm(data=newdf, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)
Paying attention to the QQplot, we see that extreme points are more normal. The Cook’s distance graph also doesn’t show values higher than 0,5.
Here, we use the stepAIC function to filter the variables to keep, by choosing the best AIC.
ms <- MASS::stepAIC(m1, direction = "both", trace = FALSE)
=======
A. LINEAR MODEL
m1 <- lm(data=data, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)
ggplot(data, aes(NumStorePurchases, Income)) + geom_point(color="blue", alpha=0.3, position = position_jitter())
# plot(Income ~NumStorePurchases, col="lightblue", pch=19, cex=2,data)
# text(Income ~NumStorePurchases, labels=ID,data, cex=0.9, font=1)
Here we observe peculiar outliers for “Income”. One income is equal to 666’666, and when income is higher than 150’000
newdf = data %>%
filter(!ID %in% c(9432, 5555, 4619, 5336, 1501, 1503, 8475, 4931, 11181) )
# plot(Income ~NumStorePurchases, col="lightblue", pch=19, cex=2,data=newdf) # odnt need that anbynmore
m1 <- lm(data=newdf, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)
Here we use the stepAIC function to select the model that has the best AIC.
ms <- MASS::stepAIC(m1, direction = "both", trace = FALSE) #il choisit le meilleur AIC
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
ms$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## NumStorePurchases ~ Kidhome * Income * Education * age
##
## Final Model:
## NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income +
## Kidhome:Education + Income:Education + Kidhome:age + Income:age +
## Education:age + Kidhome:Income:Education + Income:Education:age
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 2164 11138.08 3650.691
## 2 - Kidhome:Income:Education:age 4 26.205697 2168 11164.28 3647.870
## 3 - Kidhome:Education:age 4 14.097325 2172 11178.38 3642.652
## 4 - Kidhome:Income:age 1 8.240739 2173 11186.62 3642.276
<<<<<<< HEAD
The final model chosen here is:
finalm1 <- lm(data=newdf,NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income + Kidhome:Education + Income:Education + Kidhome:age + Income:age + Education:age + Kidhome:Income:Education + Income:Education:age)
We first use the eta_squared function to compute the effect sizes: everything that has 0.00 on the left of the 90% CI column has a “meaningless” effect size, but we still keep them on the model.
=======finalm1 <- lm(data=newdf,NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income + Kidhome:Education + Income:Education + Kidhome:age + Income:age + Education:age + Kidhome:Income:Education + Income:Education:age)
We first use the eta_squared function to compute the effect sizes: everything that has 0.00 on the left of the 90% CI column has a “meaningless” effect size.
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01parameters::model_parameters(anova(finalm1))
## Parameter | Sum_Squares | df | Mean_Square | F | p
## ------------------------------------------------------------------------------
## Kidhome | 6242.08 | 1 | 6242.08 | 1212.52 | < .001
## Income | 5446.16 | 1 | 5446.16 | 1057.92 | < .001
## Education | 10.29 | 4 | 2.57 | 0.50 | 0.736
## age | 24.65 | 1 | 24.65 | 4.79 | 0.029
## Kidhome:Income | 4.87 | 1 | 4.87 | 0.95 | 0.331
## Kidhome:Education | 31.63 | 4 | 7.91 | 1.54 | 0.189
## Income:Education | 26.11 | 4 | 6.53 | 1.27 | 0.280
## Kidhome:age | 36.69 | 1 | 36.69 | 7.13 | 0.008
## Income:age | 5.97 | 1 | 5.97 | 1.16 | 0.282
## Education:age | 11.41 | 4 | 2.85 | 0.55 | 0.696
## Kidhome:Income:Education | 52.82 | 4 | 13.20 | 2.56 | 0.037
## Income:Education:age | 49.16 | 4 | 12.29 | 2.39 | 0.049
## Residuals | 11186.62 | 2173 | 5.15 | |
##
## Anova Table (Type 1 tests)
<<<<<<< HEAD
effectsize::eta_squared(car::Anova(finalm1, type = 2), ci = 0.9, alternative = "two")
=======
effectsize::eta_squared(car::Anova(finalm1, type = 2), ci = 0.9, alternative = "two") #modified this a bit
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
## # Effect Size for ANOVA (Type II)
##
## Parameter | Eta2 (partial) | 90% CI
## --------------------------------------------------------
## Kidhome | 0.05 | [0.04, 0.07]
## Income | 0.31 | [0.29, 0.34]
## Education | 4.56e-04 | [0.00, 0.00]
## age | 2.42e-03 | [0.00, 0.01]
## Kidhome:Income | 1.20e-03 | [0.00, 0.00]
## Kidhome:Education | 2.86e-03 | [0.00, 0.01]
## Income:Education | 2.28e-03 | [0.00, 0.01]
## Kidhome:age | 3.98e-03 | [0.00, 0.01]
## Income:age | 7.07e-04 | [0.00, 0.00]
## Education:age | 1.07e-03 | [0.00, 0.00]
## Kidhome:Income:Education | 5.30e-03 | [0.00, 0.01]
## Income:Education:age | 4.38e-03 | [0.00, 0.01]
<<<<<<< HEAD
#
We call the sjPlot function to plot all the estimates or to plot only one term at a time.
sjPlot::plot_model(finalm1)
We call the sjPlot function to plot all the estimates or to plot only one term at a time.
=======sjPlot::plot_model(finalm1)
sjPlot::plot_model(finalm1, type = "pred", terms = "Kidhome", show.data = T, jitter = 1)
sjPlot::plot_model(finalm1, type = "pred", terms = "Income", show.data = T, jitter = 1)
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "Kidhome"), show.data = T, jitter = 1)
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "Education"), show.data = T, jitter = 1)
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "age"), show.data = T, jitter = 1)
<<<<<<< HEAD
sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Kidhome"), show.data = T, jitter = 1)
sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Income", "Education"), show.data = T, jitter = 1)
As we could easily assume, the number of store purchases increases along with the income. A more counter-intuitive finding is that it’s lower for customers who have kids ; this could be due to the fact that this dataframe doesn’t take into account the number of store purchases of the other parents.
Looking at the interaction effects between education and income, we see that for the same income, customers who received basic education are clearly separated from the other customers by purchasing less.
For the interaction between income and age, an interesting pattern appears : for the lowest incomes, the number of store purchases decreases with age, while the tendency is reversed for the highest wages.
We also observe an interaction effect between having kids at home and age : the number of store purchases stays stable for those who don’t have kids at home, while it decreases for those who do have kids at home, possibly because when the parents get older, kids leave the house.
Finally, over time, the augmentation of the number of store purchases proportional to the salary stays stable except for every education level except for those who received basic education.
We want a table that summarizes all the standardized estimates in a numeric form.
sjPlot::tab_model(effectsize::standardize(finalm1), rm.terms = c("*Education.Q", "Education^4", "Income:Education.C", "Education.Q" , "Kidhomeyes:Education.Q", "Income:Education.Q", "Kidhomeyes:Income:Education.Q", "Education.C" , "Kidhomeyes:Education.C" , "Income:Education.C" ,"Kidhomeyes:Income:Education.C", "Education^4", "Kidhomeyes:Education^4", "Income:Education^4" , "Kidhomeyes:Income:Education^4"), show.intercept = F) #
=======
sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Kidhome"), show.data = T, jitter = 1)
sjPlot::tab_model(effectsize::standardize(finalm1), rm.terms = c("*Education.Q", "Education^4", "Income:Education.C", "Education.Q" , "Kidhomeyes:Education.Q", "Income:Education.Q", "Kidhomeyes:Income:Education.Q", "Education.C" , "Kidhomeyes:Education.C" , "Income:Education.C" ,"Kidhomeyes:Income:Education.C", "Education^4", "Kidhomeyes:Education^4", "Income:Education^4" , "Kidhomeyes:Income:Education^4"), show.intercept = F) #what is this for? # this to show all the STANDARDIZED estimates on a numeric form in a table summarizing all thE information
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
| <<<<<<< HEAD NumStorePurchases ======= Num Store Purchases >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01 | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| Kidhome [yes] | -0.58 | -0.83 – -0.32 | <0.001 |
| Income | 0.61 | 0.48 – 0.74 | <0.001 |
| Education [Basic] | -0.62 | -2.03 – 0.79 | 0.391 |
| Education [Graduation] | -0.07 | -0.23 – 0.08 | 0.356 |
| Education [Master] | -0.11 | -0.29 – 0.08 | 0.254 |
| Education [PhD] | 0.04 | -0.14 – 0.22 | 0.663 |
| age | 0.03 | -0.09 – 0.14 | 0.668 |
| Kidhome [yes] * Income | -0.11 | -0.35 – 0.13 | 0.383 |
|
Kidhome [yes] * Education [Basic] |
-0.76 | -3.00 – 1.47 | 0.503 |
|
Kidhome [yes] * Education [Graduation] |
0.17 | -0.10 – 0.45 | 0.214 |
|
Kidhome [yes] * Education [Master] |
0.30 | -0.01 – 0.61 | 0.061 |
|
Kidhome [yes] * Education [PhD] |
0.09 | -0.22 – 0.40 | 0.555 |
|
Income * Education [Basic] |
-0.32 | -1.28 – 0.65 | 0.522 |
|
Income * Education [Graduation] |
-0.03 | -0.18 – 0.12 | 0.684 |
|
Income * Education [Master] |
0.01 | -0.16 – 0.18 | 0.894 |
| Income * Education [PhD] | -0.17 | -0.34 – -0.00 | 0.048 |
| Kidhome [yes] * age | -0.12 | -0.20 – -0.04 | 0.003 |
| Income * age | 0.03 | -0.07 – 0.13 | 0.570 |
| Education [Basic] * age | -0.66 | -1.79 – 0.47 | 0.253 |
|
Education [Graduation] * age |
-0.03 | -0.15 – 0.09 | 0.644 |
| Education [Master] * age | 0.02 | -0.12 – 0.16 | 0.761 |
| Education [PhD] * age | -0.06 | -0.20 – 0.07 | 0.342 |
|
(Kidhome [yes] * Income) * Education [Basic] |
-0.63 | -2.07 – 0.81 | 0.392 |
|
(Kidhome [yes] * Income) * Education [Graduation] |
0.12 | -0.14 – 0.38 | 0.374 |
|
(Kidhome [yes] * Income) * Education [Master] |
0.23 | -0.08 – 0.54 | 0.153 |
|
(Kidhome [yes] * Income) * Education [PhD] |
0.44 | 0.13 – 0.76 | 0.005 |
|
(Income * Education [Basic]) * age |
-0.47 | -1.19 – 0.26 | 0.209 |
|
(Income * Education [Graduation]) * age |
-0.09 | -0.20 – 0.02 | 0.113 |
|
(Income * Education [Master]) * age |
-0.04 | -0.16 – 0.09 | 0.574 |
|
(Income * Education [PhD]) * age |
0.02 | -0.10 – 0.15 | 0.716 |
| Observations | 2204 | ||
| R2 / R2 adjusted | 0.516 / 0.510 | ||
We see that R^2 and adjusted R^2 have acceptable values.
clean_data <-data[rowSums(is.na(data))==0, ]
pm1<-prcomp(clean_data[,-c(1,3,4,6,7,8,18)], scale=TRUE)
summary(pm1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0965 1.4242 1.04714 1.00079 0.90502 0.80789 0.70029
## Proportion of Variance 0.3663 0.1690 0.09138 0.08346 0.06826 0.05439 0.04087
## Cumulative Proportion 0.3663 0.5353 0.62667 0.71013 0.77839 0.83278 0.87364
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.65473 0.64171 0.61837 0.54168 6.551e-15
## Proportion of Variance 0.03572 0.03432 0.03187 0.02445 0.000e+00
## Cumulative Proportion 0.90937 0.94368 0.97555 1.00000 1.000e+00
Looking at cumulative proportion, we need 8 principal components to reach 0,90 proportion of variance.
We also see that PC6 is non-significant.
We delete PC6 from the model, and this new model m3 is not significantly different from m2.
Therefore, we keep the simplest, which is m3. With summary(m3), we see that all PCs are significant. We keep the model m3.
data2 <- cbind(clean_data, pm1$x)
m2 <- lm(data=data2, NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8)
summary(m2)
##
## Call:
## lm(formula = NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 +
## PC6 + PC7 + PC8, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2939 -0.3155 -0.0040 0.3936 5.8489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.80444 0.02159 268.817 < 2e-16 ***
## PC1 -1.16939 0.01030 -113.515 < 2e-16 ***
## PC2 0.12699 0.01517 8.374 < 2e-16 ***
## PC3 -0.86347 0.02063 -41.865 < 2e-16 ***
## PC4 -0.12584 0.02158 -5.831 6.32e-09 ***
## PC5 -0.69782 0.02386 -29.242 < 2e-16 ***
## PC6 -0.02038 0.02673 -0.762 0.44602
## PC7 -2.15573 0.03084 -69.899 < 2e-16 ***
## PC8 -0.10767 0.03299 -3.264 0.00112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.015 on 2200 degrees of freedom
## Multiple R-squared: 0.9031, Adjusted R-squared: 0.9027
## F-statistic: 2562 on 8 and 2200 DF, p-value: < 2.2e-16
m3 <- update(m2, . ~ . - PC6)
anova(m3, m2)
## Analysis of Variance Table
##
## Model 1: NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + PC8
## Model 2: NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
## PC8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2201 2266.4
## 2 2200 2265.8 1 0.59833 0.5809 0.446
summary(m3)
##
## Call:
## lm(formula = NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 +
## PC7 + PC8, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2783 -0.3167 -0.0046 0.3917 5.8675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.80444 0.02159 268.843 < 2e-16 ***
## PC1 -1.16939 0.01030 -113.525 < 2e-16 ***
## PC2 0.12699 0.01516 8.374 < 2e-16 ***
## PC3 -0.86347 0.02062 -41.869 < 2e-16 ***
## PC4 -0.12584 0.02158 -5.832 6.3e-09 ***
## PC5 -0.69782 0.02386 -29.245 < 2e-16 ***
## PC7 -2.15573 0.03084 -69.906 < 2e-16 ***
## PC8 -0.10767 0.03298 -3.264 0.00111 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.015 on 2201 degrees of freedom
## Multiple R-squared: 0.903, Adjusted R-squared: 0.9027
## F-statistic: 2928 on 7 and 2201 DF, p-value: < 2.2e-16
=======
clean_data <-data[rowSums(is.na(data))==0, ]
pm1<-prcomp(clean_data[,-c(1,3,4,6,7,8,18)], scale=TRUE)
# summary(pm1)
data2 <- cbind(clean_data, pm1$x) #double check!
# str(data2)
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
sjPlot::tab_pca(pm1, show.cronb = F, show.var = T, nmbr.fctr = 8)
| Component 1 | Component 2 | Component 3 | Component 4 | Component 5 | Component 6 | Component 7 | Component 8 | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 0.00 | -1.00 | 0.02 | -0.01 | 0.05 | -0.02 | 0.06 | -0.00 |
| Income | -0.14 | 0.12 | 0.05 | -0.02 | -0.81 | 0.10 | -0.29 | 0.20 |
| Recency | 0.00 | 0.01 | -0.00 | 1.00 | -0.00 | 0.01 | -0.00 | 0.01 |
| MntWines | -0.15 | 0.09 | 0.00 | 0.02 | -0.52 | 0.18 | -0.70 | 0.01 |
| MntFruits | -0.86 | -0.01 | 0.06 | -0.01 | -0.15 | 0.11 | -0.23 | 0.12 |
| MntMeatProducts | -0.52 | -0.02 | 0.04 | 0.03 | -0.71 | 0.09 | -0.17 | 0.10 |
| MntFishProducts | -0.71 | 0.02 | 0.08 | -0.01 | -0.26 | 0.22 | -0.13 | 0.31 |
| MntSweetProducts | -0.40 | -0.01 | 0.05 | 0.02 | -0.24 | 0.12 | -0.19 | 0.83 |
| MntGoldProds | -0.23 | 0.03 | -0.04 | 0.01 | -0.15 | 0.93 | -0.18 | 0.10 |
| NumDealsPurchases | 0.09 | 0.04 | -0.99 | 0.00 | 0.05 | 0.04 | -0.05 | -0.04 |
| NumStorePurchases | -0.28 | 0.08 | -0.07 | -0.01 | -0.19 | 0.12 | -0.85 | 0.20 |
| age | -0.00 | 1.00 | -0.02 | 0.01 | -0.05 | 0.02 | -0.06 | 0.00 |
| Proportion of Variance | 36.63 % | 16.90 % | 9.14 % | 8.35 % | 6.83 % | 5.44 % | 4.09 % | 3.57 % |
| Cumulative Proportion | 36.63 % | 53.53 % | 62.67 % | 71.01 % | 77.84 % | 83.28 % | 87.36 % | 90.94 % |
| varimax-rotation | ||||||||
Conclusions
=======We need 8 principal components to attain 0,90
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01